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Abstract 

/  r  'I  •  :  j/  .  .  “  '  i  •  « 

>  show  that  the  greatest  common  divisor  of  two  n-bit  integers  (given 
in  the  usual  binary  representation)  can  be  computed  in  time  0(n)  on  a 
linear  array  of  0(n)  identical  systolic  cells,  each  of  which  is  a  finite- 
state  machine  with  connections  to  its  nearest  neighbours. 
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1.  Introduction 


We  consider  the  problem  of  computing  the  greatest  common  divisor  GCD 
(a,b)  of  two  positive  integers  a  and  b  *  This  problem  arises  when 
performing  exact  rational  arithmetic  during  symbolic  and  algebraic 
computations,  in  factorisation  [4,  8],  etc.  For  serial  computation  the 
classical  Euclidean  algorithm  and  its  variants  have  been  thoroughly 
analysed  and  their  behaviour  is  well  understood  [1,  11,  14].  In  this  paper 
we  consider  parallel  algorithms  which  could  readily  be  implemented  in 
hardware.  We  assume  that  a  and  b  are  represented  in  binary  in  the 
usual  way. 

The  integer  GCD  problem  is  superficially  similar  to  the  problem  of 
computing  a  GCD  of  two  polynomials  [2,  5,  6,  17].  The  significant 
difference  between  integer  and  polynomial  GCD  computations  is  that  carries 
have  to  be  propagated  In  the  former,  but  not  in  the  latter. 

Since  our  aim  is  to  obtain  parallel  algorithms  which  can  readily  be 
implemented  with  VLSI  technology  [10,  13],  we  do  not  permit  the  most 
general  "unbounded  parallelism"  model  of  [2].  Instead,  we  restrict  our 
attention  to  algorithms  which  can  be  implemented  on  a  linear  array  of 
0(n)  "systolic  processors"  or  "cells"  with  nearest-neighbour 
communication.  Such  "systolic  algorithms"  have  been  surveyed  in  [7,  12]. 

In  Sections  4-6  we  show  that  GCD  (a,b)  can  be  computed  in  time  0(n)  on 
a  linear  array  of  0(n)  cells,  where  each  cell  is  a  finite-state  machine 
which  could  be  Implemented  using  a  special  purpose  VLSI  chip  or  a 
mlcroprogrammable  chip  such  as  the  "PSC"  chip  designed  at  Carnegle-Mellon 
University  [9]. 


\  ' 
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This  paper  is  a  companion  paper  Co  [6],  where  Che  easier  problem  of 
polynomial  GCD  compucacion  (over  a  finice  field)  is  considered.  It  is 
recommended  chat  the  reader  unfamiliar  with  systolic  algorithms  should  read 
[6]  or  Sections  1-2  of  [7]  before  reading  Sections  4-6  of  this  paper. 

In  Section  2  we  consider  various  serial  algorithms  for  integer  GCD 
computation.  The  classical  Euclidean  algorithm  [11]  and  the  binary 
Euclidean  algorithm  [3,  8,  15]  do  not  appear  to  lead  to  good  parallel 
algorithms,  but  a  new  algorithm  (PM),  which  may  be  regarded  as  a  variant 
of  the  binary  Euclidean  algorithm  and  is  described  in  Section  3,  does  lead 
to  a  good  parallel  algorithm.  In  Section  4  we  consider  the  systolic 
implementation  of  algorithm  PM,  and  in  Section  5  we  give  an  upper  bound  on 
the  number  of  systolic  cells  required.  A  lower  bound  and  some  empirical 
results  are  given  in  Section  6.  Finally,  a  possible  VLSI  implementation  is 
discussed  in  Section  7. 


2.  The  classical  and  binary  Euclidean  algorithms 


Assume  that  a  and  b  are  positive.  The  classical  Euclidean 
algorithm  [11]  may  be  written  as: 


while  b  i1  0 


’  , 

b 

a  mod  b, 


GCD 


s  • 


This  is  simple,  but  not  attractive  for  a  systolic  implementation  because 
the  inner  loop  includes  the  division  operation  "a  mod  b"  which  takes  time 
0(n)  .  The  "binary"  Euclidean  algorithm  [3,  11,  15]  avoids  divisions,  so 
is  superficially  more  attractive.  The  binary  Euclidean  algorithm  may  be 


written  as: 
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{assume  a,  b  odd} 
t  |a-b|; 
while  t  f  0  do 
begin 

repeat  t  r*  t  div  2  until  odd  (t); 
if  a  >  b  then  a  :■  t  else  b  :■  t; 
t  :*  |a-b| 
end; 

GCD  i •  a  • 

Figure  1:  The  binary  Euclidean  Algorithm  B^  (usual  version) 


An  alternative  which  avoids  the  use  of  an  auxiliary  variable  t  and 
the  absolute  value  function  is  given  in  Figure  2. 


{assume  a  odd,  b  ^  0} 
while  b  i  0  do 
begin 

while  even  (b)  do  b  :*  b  div  2; 
if  a  >  b  then  a  <->  b; 
b  : -  b-a 
end; 

GCD  :«■  a  . 

Figure  2:  The  binary  Euclidean  Algorithm  B2  (alternative  version) 


It  is  easy  to  verify  that  Algorithm  with  initial  (a,b)  *  (min  (a,tT), 
|a-b|)  generates  the  same  sequence  as  Algorithm  with  initial 
(a,b)  ■  (a,b).  Conversely,  Algorithm  with  initial  (a,b)  *•  (a+b,a) 
gives  the  same  sequence  as  Algorithm  B^  with  initial  (a,b)  *  (a,b). 

Hence,  the  two  algorithms  are  essentially  equivalent. 

The  inner  loops  of  algorithms  B^  and  B^  involve  the  comparison 
"if  a  b",  and  in  the  worst  case  this  takes  time  H(n)  because  the  result 
depends  on  all  bits  in  the  binary  representations  of  a  and  b  . 


{  ' 
A 


_ 
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To  prove  convergence  of  algorithms  B^  and  B^,  it  is  sufficient  to 

3 

note  that  a  +  b  decreases  by  a  factor  ^  at  each  iteration,  because  the 

Vf 

larger  of  a  and  b  is  replaced  by  |a-b|/2  for  some  k  >  1  . 


Algorithm  PM 


We  attempt  to  modify  the  binary  Euclidean  algorithm  so  that  operations 
in  the  inner  loop  depend  only  on  the  low-order  bits  of  a  and  b  (and 
hence  the  inner  loop  can  be  pipelined  on  a  systolic  array).  Our  first 
attempt  is  algorithm  B^. 


(assume  a  odd,  |a|  <  2n,  |b|  <  2n} 
a  n;  { I  a  I  <  2a} 

6  n;  (|b|  <  26} 

while  b  j*  0  do 
begin 

while  even  (b)  do 
begin 

b  :«  b  div  2;  g  :«  g  -  1 
end; 

(odd(b),  |b|  <  26} 

LO  :  if  a  >_  6  then 
begin 

a  <->  b;  o  <->  8 
end; 

(odd(a) ,  odd(b) ,  |a|  <  2®,  ibt  <  26,  0  <  g} 

LI  :  if  even  ((a+b)  div  2)  then  b  :■  (a+b)  div  2 

else  b  :■  (a-b)  div  2 


end; 


GCD  :■  la |  . 


Figure  3  :  Algorithm  B^:  a  first  attempt 
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The  Idea  of  algorithm  is  to  maintain  integers  a  and  £  such  that 
|a|  _<  2°,  |b|  <.2®,  and  replace  the  test  "a  b"  in  algorithm  hy  the 
weaker  but  more  easily  implemented  test  "a  >_  6".  (For  details  of  how 
this  test  is  implemented  see  Section  4  below;  at  present  we  merely 
note  that  a  and  g  are  number  with  O(log  n)  bits  whereas  a  and  b 
have  0(n)  bits.) 


It  is  tempting  to  replace  statement  LI  by  the  simpler 

"L2  :  b  : -  b-a". 

However,  the  algorithm  would  then  fail  to  converge  for  certain  initial 
data,  e.g.  a  *  1,  b  ■  3.  The  problem  is  that  a  _<  6  does  not  imply  a  b, 

Q  6 

so  the  variables  b  and  a  may  become  negative,  then  |a|  2  ,  |b|  <_  2^ 

g 

and  o  6  before  L2  does  not  imply  | b  1  <_  2  after  L2.  Changing  the 
test  ''4  >  0"  to  "  a  >  0"  at  statement  LO  does  not  help  (try  a  ■  3, 
b  ■  1).  We  do  have  convergence  of  the  algorithm  given  in  Figure  3  because 


1  b  |  < 


2°  +  2l 


<  2 


after  statement  LI,  and  since  b  is  even  after  LI,  a  +  0  decreases  by 
each  time  (except  possibly  the  first)  that  the  inner  loop  is  executed. 
Hence  the  inner  loop  can  be  executed  at  most  2n  +  1  times. 


Another  alternative  is  to  replace  statement  LI  by 

"L3 :  b  :•  |b-a|". 

With  this  modification  the  algorithm  converges,  but  the  inner  loop  is 
difficult  to  pipeline  because  of  the  absolute  value  function  occurring  at 


i. 


8 


statement  L3.  As  given  in  Figure  3  the  inner  loop  is  easy  to  pipeline 
because  the  operations  on  a  and  b  depend  only  on  the  two  least 
significant  bits  of  their  2's  complement  binary  representations. 


Observe  that  the  two  variables  a  and  6  in  algorithm  are 
unnecessary:  only  their  difference  5  ■  a  -  6  is  required.  Using  this 
fact  and  allowing  for  the  possibility  of  a  being  even,  we  get  the  algorithm 
PM  (for  "plus-minus").  The  comments  involving  a  and  3  are  Included  only 
to  show  the  relationship  to  Algorithm  Bj. 


LO: 


LI: 

L2: 


L3: 


{assume  a,  b  not  both  zero} 

{a  :«  n;  B  :»  n}  g 

6:-  0;  {6  -  a-  6  ,  |a|  <  2  ,  |b|  <  2P} 

m  :»  1; 

while  even  (a)  and  even  (b)  do 
begin 

a  :■  a  div  2;  b  :«■  b  div  2;  {a  :»a-l,  6  :* 

m  :=  2*m 

end; 

if  even  (a)  then  a  <->  b; 

{now  a  odd,  |a|  <  2°,  |b|  <,  2P} 
while  b  j*  0  do 
begin 

while  even  (b)  do 
begin 

b  :■  b  div  2;  <5  :«6+l  {8  :_8-l} 
end; 

if  6  _>  0  then 
begin 

a  <->  b;  6  :-  -6  {a  <->  6} 
end; 

if  even  ((b+a)  div  2)  then  b  :*  (b+a)  div  2 
else  b  :■  (b-a)  div  2 

end; 

G  :•  m*a.  {G  ■  +  GCD} 


B  -1} 


Figure  4:  Algorithm  PM 
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4.  Systolic  implementation  of  algorithm  PM 

Consider  now  che  implementation  of  algorithm  PM  on  a  systolic  array. 

We  assume  that  a  and  b  are  represented  as  (2's  complement)  binary 
Integers  which  enter  the  array  least-significant  bits  first.  The  initial 
"while"  loop  LO  and  the  final  multiplication  by  m  are  easily  implemented: 
essentially  the  array  just  transmits  unchanged  the  low-order  zero  bits  of 
a  and  b  . 

There  is  no  need  for  the  array  to  check  the  termination  criterion: 
once  b  becomes  zero  the  systolic  cells  to  the  right  will  merely  implement 
the  "while"  loop  L2  (b  :■  b  div  2;  6  :•  6+1)  and  transmit  a  to  the 
right. 

In  the  inner  loop  LI  the  essential  tests  involving  a  and  b  depend 
only  on  the  low-order  bits  of  a  and  b  .  Hence,  a  cell  can  perform  these 
tests  before  the  high-order  bits  of  a  and  b  reach  it  via  cells  to  its 
left. 


It  only  remains  to  consider  the  implementation  of  the  operations  on  6 
in  algorithm  PM.  The  only  operations  required  are  "6  :*  <5+1",  "6  :»  -6", 
and  "if  6  0  ...".  Rather  than  represent  S  in  binary,  we  choose  a 

"sign  and  magnitude  unary"  representation,  i.e.  keep  sign  (6)  and  |6| 
separate,  and  represent  e  -  |<5|  in  unary  as  the  distance  between  1-bits 
in  two  streams  of  bits.  With  this  representation  all  required  operations 
on  6  can  be  pipelined.  For  example,  the  operation  "6  :■  6+1"  merely 


1 
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involves  shifting  one  bit  stream  relative  to  the  other  and  possibly 
complementing  the  sign  bit. 

These  considerations  lead,  after  some  straightforward  optimizations, 
to  the  systolic  cell  illustrated  in  Figure  5.  The  cell  has  six  input 


ain  - ► 

CD 

waitj 

bin  - ► 

m 

rshift  - 

startin - ► 

( start] 

| carry |  - 

startoddin — *■ 

startoddj 

swap 

epsin  - ► 

1  epsj 

[eps2|  - 

negin  - ► 

Inegj 

j  minus |  - 

aout 

bout 

startout 

startoddout 

epsout 

negout 


Figure  5;  Systolic  cell  for  integer  GCD  computation 


streams  (each  one  bit  wide),  and  six  output  streams  which  are  connected  to 
the  corresponding  input  streams  of  the  cell  to  the  right  .  The  input 
streams  are  a in  and  bin  for  the  bits  in  the  2's  complement  binary 
representation  of  the  numbers  a  and  b  (least  significant  bit  first), 
startin  to  indicate  the  least  significant  bit  of  a  ,  and  three  additional 
streams  startoddin,  epsln  and  negln  which  should  be  all  zero  on  input  to 
the  leftmost  cell,  startoddin  is  used  to  indicate  the  least  significant  1- 
bit  of  a  and  b  (so  the  distance  between  1  bits  in  the  startin  and 
startoddin  streams  is  log2<n  ,  where  m  is  the  highest  power  of  2  in 
GCO  (a,b),  as  in  Figure  4).  epsln  and  negin  are  used  to  represent 

— _ _ _ _  • 


11 


e  ■  |6|  and  sign  (-6)  respectively  (e  is  represented  as  the  distance 
between  1-bits  in  the  epsin  and  startoddin  streams). 

The  cell  has  twelve  internal  state  bits:  one  for  each  of  the  six 
inputs,  and  six  additional  bits  (wait ,  shift ,  carry,  swap,  eps2  and  minus). 
The  wait  bit  is  on  if  the  cell  is  waiting  for  the  first  nonzero  bit  in  the 
binary  representations  of  a  and  b  .  The  shift  bit  signifies  that  the 
cell  is  shifting  the  bits  of  b  right  faster  than  those  of  a  (i.e. 
implementing  "b  :=*  b  div  2").  The  carry  bit  has  the  obvious  meaning  for 
a  cell  performing  the  serial  binary  operation  a+b  or  a-b.  The  swap 
bit  signifies  that  a  and  b  are  to  be  interchanged  (and  the  sign  of  5 
reversed);  to  save  cells  the  interchange  may  be  combined  with  a  shift. 
eps2  is  used  to  save  a  bit  on  the  epsin  stream  (so  the  operations 
e  :■  e-1  and  e  e+1  can  be  implemented  by  shifting  the  epsin  stream 
either  faster  or  slower  than  the  "normal”  speed  of  one  cell  per  two 
cycles).  Finally,  minus  indicates  whether  subtraction  or  addition  is  to 
be  performed  at  statement  L3.  A  definition  of  the  systolic  cell  is  given 
in  Appendix  A. 
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5.  Upper  bounds  on  the  number  of  systolic  cells  required 

In  this  section  we  show  that  the  systolic  array  described  in  Section 
4  will  correctly  compute  +  GCD  (a,b)  for  n-bit  integers  a  and  b, 
provided  that  the  number  of  cells  in  the  array  is  at  least  r3.1105n1. 

First  we  prove  a  weaker  result* 

Theorem  1 

4n  cells  as  defined  in  Appendix  A  are  sufficient  to  compute  +  GCD  (a,b) 
for  any  n-bit  integers  a  and  b  . 

Proof 


Consider  the  systolic  implementation  of  Algorithm  PM.  The  statements 
"a  :=  div  2;  b  :■  b  div  2;  {a  :*  a-1;  8  :=  8-1} 
are  implemented  by  one  systolic  cell.  Similarly  for  che  statements 
"b  :*  b  div  2;  6  :=  6+1  {6  :*  6-1} 


and  the  statement 

"L3:  if  even  ((a+b)  div  2)  then  b  :*  (a+b)  div  2 
else  b  :■  (a-b)  div  2". 

Suppose  these  statements  are  executed  pp  P2  and  p^  times 
(respectively)  before  b  becomes  zero.  This  requires  p  *  P]+P2+P3 
cells.  Other  statements  such  as 
"if  6  >  0  then 
begin 
a  <->  b; 

<5  -6  {a  <->  6} 

end" 

~  ^ - - - -  - ... _ _ _ .  .  -A.*.- 
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may  be  disregarded  because  they  are  implemented  by  cells  which  have  been 
counted  already. 

At  statement  Li,  if  b  0  we  have  6-a-6<.0,0»‘|a|£2a, 

|b|  <_  2^  .  Hence  8  >.  a  >.  0  ,  80  <*+8>0. 

Now  2n-(a+g)  *  2p^+P2  anc*  ?2  Z  P3~^  (since  b  is  even 
after  execution  of  statement  L3).  Thus 

p  *  P1  +  p2  +  p3 

<  Px  +  2p2  +  1 

<  2(2Pl  +  p2)  +  1 

<  4n  -  2(a+8)  +  1 

<  4n  +  1, 

i.e.  4n  +  1  cells  suffice.  However,  the  last  cell  implementing  statement 
L3  merely  sets  b  to  zero  (at  this  point  b  *  +  a)  and  does  not  change 
a  ,  so  4n  cells  would  suffice  to  compute  +  GCD.  This  completes  the 
proof  of  Theorem  1.  // 

Before  proving  a  stronger  result,  we  need  some  Lemmas. 

Lemma  1 

Let  p  -  1^g1 7  *  0.6404  and  l  <m  <k  .  Then 

(i)  1  +  p  -  4U2 

(ii)  V  >  2~2/3 

(ili)  (l+y-1)/2k+1  £  y(2k+m-l)/2 
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(iv) 

(v) 

(vi) 

(vii) 


2k+ta+l 

2-(k-hn)  +  (1+y-1)(1_4“m)/(3.2k+1)  <  u  2 

2k+m 


2k+m-l 

2-(k+m)  +  y-1(1_4-m)/(3.2k)  <  y  2 

2k-hn-2 

y-l/2k+m  +  (1_4“ni)/(3.2k)  <  y  2 


Proof 


(i)  is  easy.  For  (ii),  we  have 


w3  ”  13+5^77  13+5.4  JL_ 

128  128  4 


so 


V  >  2 


-2/3 


Note  that  *5  <  y  ,  so 

2m-k  <  yk-m. 


Thus,  it  is  sufficient  to  prove  (iii)  -  (vi)  for  k  »  m  ;  the  inequalities 
for  k  >  m  then  follow  by  multiplying  the  left  sides  by  2m-k  and  the 
right  sides  by  yk_m.  With  k  -  m,  (iii)  -  (vii)  reduce  to  (iii)'  -  (vii)' 

(iii) '  (l+y“1)/2m+1  ^  y(3m_1)/2  3mH 

(iv) '  2’2m  +  (l+y"l)(l-4"m)/(3.2m+1)  <  y  2 

(v) '  2“m<y3m/2 

3m- 1 


(vi) '  2_2m  +  y_1(l-4"®)/(3.2°)  _<  y  2 

(vii) '  2'2m  +  y(l-4“m)/(3.2n)  <  y3m/2 


A _ 1 _ i _ L 
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We  observe  that  (v)'  follows  from  (li),  (vi)'  follows  from  (ill)'  (since 

2-2m  <  2-(m+l)  an(j  (i-4*ni)/(3.2m)  <  2-(nrt'1)),  and  (vli)' 

3grt~l  3m 

follows  from  (iv)'  (since  p  <.  j —  and  p  2  <  p  2  ).  Hence 

it  is  sufficient  to  prove  (iii)'  and  (iv)'.  When  m  -  1  ,  (iii)'  and  (iv)' 

reduce  to  equalities 

(iii) "  (l+p"1)  -  u  and 

(iv) "  1/4  +  ( 1-f-jj-1 ) / 1 6  -  p2  . 

For  m  >  1  ,  we  multiply  the  left  side  of  (iii)"  by  2^~m  and  the 
right  side  by  y3(m-l)/2  >  giving  (iii)'  (’s  <  w3^2  by  (ii)). 

Similarly,  from  (iv)"  we  get 

2-(m+l)  +  (l+p"l)/2nrt’3  p(3m+l)/2 

but  it  is  easy  to  show  (using  l-4*”m  <_  2(l-21_m)  and  P-*  <.  2)  that 

2-2n  +  (l+p‘1)(l-4‘m)/3.2nH'1 
<_  2"(m+l)  +  (l-pp”1)/2m+3, 

so  (iv)'  follows.  This  completes  the  proof  of  Lemma  1.  // 


Consider  Algorithm  P  defined  in  Figure  6.  Provided  the  initial 
condition 

I:  {<5-0,  ] a |  <  K/p,  |b|  <  K,  p  -  0} 
holds,  where  p  -  (l+»f7)/8  as  in  Lemma  1,  then  the  condition 
H:  {6-0  ->  |a I  |b|  <  Kpp/2} 

holds  at  all  points  Indicated  by  {H}  in  Figure  6. 
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UJ 

repeat 

LI: 

{H} 

if  6  >_  0  then 

L2: 

begin 

if  odd  (a)  then 

begin 

if  even  ((a+b)  div  2)  then  a  :*  (a+b)  div  2 

else  a  :■  (a-b)  div  2; 

P  :■  P+1 
end; 

while  a  f  0  and  even  (a)  do 
L3:  begin 

a  :■  a  div  2; 

P  :■  P+1; 

6  6-1 

L4:  {H} 

end 

end 

else  {  6  <  0} 

L5:  begin 

if  odd  (b)  then 
begin 

if  even  ((a+b)div  2)  then  b  :*  (b+a)div  2 

else  b  :*  (b-a)div  2; 

p  :■  p+1 

end; 

while  b  t  0  and  even  (b)  do 
L6:  begin 

b  :■  b  div  2; 

P  :■  P+1. 

6  6+1 

L7:  {H} 

end 

end 

until  (a«0)  or  (b“0). 

Figure  6  :  Algorithm  P 
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Proof 

Note  that  p  increases  monotonically  during  execution  of  algorithm 

* 

P.  The  proof  of  Lemma  2  is  by  induction  on  p  >_  0  •  If  p  »  0  we  have 

H  true  at  label  LI,  by  the  initial  condition  I. 

Note  that  6  decreases  monotonically  during  execution  of  the  block. 

L2,  then  increases  monotonically  during  execution  of  the  block  L5,  then 
decreases  monotonically  during  the  execution  of  the  block  l2,  etc.  Also,  H 
is  trivially  true  if  6  j  0  . 

Consider  an  execution  of  block  L5  such  that  6  »  0  at  statement  L7. 

Let  the  values  of  (a,  b,  p)  at  this  point  be  (a^,  b^,  p^).  We  shall 

show  that  H  holds,  i.e.  that 

P?/2-l  p?/2 

|a2l  £  K  u  1  and  Ib^l  £  K  u  1  . 

First,  suppose  the  previous  occurrence  of  6  ■  0  occurred  in  block  L2,  say 

when  the  value  of  (a,  b,  p)  at  L4  was  (aQ>  bQ»  Pq),  where 

PQ  <  P2  .  By  the  inductive  hypotheses, 

p— / 2—1  ,  Pr>/2 

la^l  £  K  p  0  and  IbgK  K  u  ®  . 

There  are  two  cases: 

(i)  Sq  odd,  and 

(ii)  even. 

Consider  case  (i)  first.  After  reaching  L4  with  p  ■  pQ ,  block  L2  is 

executed  once  more.  Thus,  at  L5  we  have,  for  some  k  >_  1  ,  (5,a,b,p)  - 

(fi^AifbjiPj),  where  -  -k,  pt  -  P0+k+l,  aL  -  (aQ  +  b0)/2k+1» 
and  b^  »  b^  .  Then  block  L5  is  executed  some  number  m  £  1 
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times  until  L7  is  reached  with  6  -  0  .  Thus,  there  is  a  sequence  of 
positive  integers  k^ . k  such  that 


and 

where 

and 

for  1  3  1,  « • . ,  m  • 


m 

Z  ki 
i-1  1 


k 

+  k  +  m 


> 


b' 

m 


<bi-i  ± 


^+1 


It  follows  that 


and 


U2I  <  U0l/2k+1  +  |b0l/2k+1 
<  K  yP°  (y_1+i)/2k+1 


<  K  V 


(pn+2k+m-l)/2 


by  Ind.  Hypothesis 
by  Lemma  1  (iii) 


K  y 


lb2l  <  lbjl/2 


P2/2-l 


1  1 


k +1  k  +l+k  ,+1 

+  I  a  j | (1/2  +  1/2  m  m_1  +  ...  +  i/21 


L(Ic  +1 ) 


<  'bol/2 


k+m 


<  lb  1/2 


<  K  y 


+  |a1l(l/4+l/42+...+l/4m) 

k+m 
Pq/2 


+  |a  |(l-4““)/3 


[ l/2k+m  +  (y'1+l)(l-4_m)/3.2k+1] 


by  Ind.  Hypothesis 


(p0+2k+m+l)/2 


P2/2 


<VLv 
<  K  w 

Thus,  Che  proof  for  case  (i)  is  complete. 


by  Lemma  1  (iv) 


Now  consider  case  (ii)  (aQ  even).  After  reaching  L4  with  p  ■  Pq 
and  6  ■  0  ,  block  L3  is  executed  some  positive  number  k  times  more, 

then  6  *  -k  <  0  and  block  L3  is  executed.  Thus  we  have  a^,  b, , 

a2*  b2*  etc*  as  above  except  that 


an/2* 


'1 


P0  +  k 


la2l 


P,/2-l  P,/2 

<  K  v  1  and  |  b^  I  <  K  y  *■ 


Hence 

follows  from  the  inductive  hypothesis  (for  p  *  Pg)  and  Lemma  1  ((v)  and 
(vi)). 


Now,  suppose  that  the  previous  occurrence  of  6  ■  0  occurred  in  block 
L5  (with  (a,b,p)  ■  (a^.b^Pg)  say)  rather  than  in  block  L2.  This 
can  only  be  the  case  if  bg  is  odd.  Then,  block  L2  is  executed  once, 
after  which  (6,a,b,p)  ■  (-k,a^ ,b^ ,p^ )  say,  and  for  some  k^,...,km 
we  have 

al  ‘  <au  ±  b0)/2k+1  ,  bj  -  bg. 

Pi  -  P0  +  k  +  1, 

"2  “  »2. 

|b2  I  <  | bx  | / +  |a1l(l-4“m)/3,  and 

P2  "  P1  +  k  +  m  * 


The  result  follows  as  in  case  (i)  above 
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Finally,  we  should  consider  an  occurrence  of  6  ■  0  at  statement  L4 
rather  than  at  L7,  with  the  previous  occurrence  of  6  •  0  at  L7,  and  show 
that  H  holds.  The  proof  of  this,  is  similar  to  case  (ii)  above,  using 
Lemma  1  (vii),  and  hence  is  omitted. 

This  completes  the  proof  of  Lemma  2,  by  induction  on  p  .  // 

Lemma  3 


Consider  Algorithm  P  defined  in  Figure  6.  Provided  the  initial 
condition 

I  :  {6  -  0,  |a  1  <  K/p,  |b|  <  K,  p  -  0} 

holds,  where  p  ■  (1+vT7)/8,  then  the  conditions 
JA  :  |b|  <  K  pp/2 

and 

J2  :  1  a |  _<  K  pp/2-1 

hold  whenever  execution  reaches  the  points  indicated  by  {J^}  and  {J2} 
respectively. 

Proof 

Consider  (J2  Is  similar).  Clearly  6^0.  If  p  *  0  the 
result  follows  from  I  ,  so  suppose  p  >  0  .  Then  L4  and  L7  must  have  been 
executed  with  6  ■  0  .  At  the  most  recent  such  execution  suppose  the 
value  of  (a,b,p)  was  (a',b',p').  By  Lemma  2,  |a' |  <  Kup  ^2_1 

|b' |  <  K  pP'/2  . 


\  ' 

_ A _ .  -k _ - 


and 


If  6  ■  0  (so  p  -  p' )  we  are  finished.  If  6  >  0  then  block  L6 
must  have  been  executed  6  times  since  p  -  p'  ,  so 

b  -  b'/2^  ,  p  -  p'  +  • 

Thus  |b|  <  K  li(p_6)/2  2“6 

<  K  pP''2  as  v“1/2  <  2  .  // 

We  are  now  ready  to  prove  the  main  result  of  this  section: 

Theorem  2 

(cnl  cells  as  defined  in  Appendix  A  are  sufficient  to  compute 

+  GCD(a,b)  for  any  n-bit  Integers  a  and  b  ,  where 

c  -  2/log-  *57-1  “3.1105  . 

*■2 

Proof 

The  result  is  trivial  if  a  or  b  is  zero,  so  suppose  both  a  and  b 
are  nonzero. 

Let  2k  (k  £  0)  be  the  largest  power  of  two  which  divides  both  a 

k  k 

and  b  .  Then  the  loop  L0  of  Algorithm  PM  reduces  (a,b)  to  (a/2  ,  b/2  ) 
and  its  systolic  implementation  uses  k  cells.  At  this  point 
6-0,  | a |  £  2n_k  »  I b |  _<  2n-lt  ,  and  at  least  one  of  a  and  b  is 
odd.  Subject  to  this  condition,  it  is  easy  to  see  that  the  "while"  loop 
LI  of  Algorithm  PM  is  equivalent  to  Algorithm  P  (Figure  6)  modulo 
interchanges  of  a  and  b  and  corresponding  sign  reversals  of  6  . 
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Furthermore,  the  variable  p  of  Algorithm  P  counts  the  number  of  cells 
required  in  che  systolic  implementation  (Appendix  A)  of  loop  LI  of 
Algorithm  PM. 

Algorithm  P  terminates  when  a  ■  0  or  b  ■  0  ;  immediately  before 
this  occurs  we  have  |a|  *  |b|  ■  1 G |  at  {J^}  or  {J2>  ,  where  G 
is  the  GCD  being  computed.  By  Lemma  3, 

1  <  I  G  |  <  2n_k  yP'2"1 

Thus  p/2-1  <  (n-k)/log2(l/y)  . 

The  total  number  of  cells  required  is 

p  +  k  <_  2(n-k)/log2(l/u)  +  k  +  1 

and  the  right  hand  side  attains  its  maximum  (over  k  ^  0)  of 
2n/log2(l/u)  +  1  when  k  *  0  .  Thus,  since  log2(l/y)  is  not  rational, 
the  result  follows.  // 


6.  A  lower  bound  on  the  number  of  systolic  cells 

The  following  theorem  shows  that  the  constant  c  in  Theorem  2  can  not 
be  reduced  below  3. 

Theorem  3 


3n-5  (n  mod  2)  of  the  cells  defined  in  Appendix  A  are  necessary  to 

compute  +  GCD(a,b)  for  certain  n-bit  positive  integers  a,  b  . 


Proof 


The  resale  is  trivial  if  n  £  2  ,  so  suppose  n  ^  3  .  If  n  is  even, 
take  a  *  3.2n-^  +  1,  b  »  3.2n~^  -  1  .  It  is  easily  verified  that 
3n-5  cells  are  necessary  and  sufficient  to  compute  +<JCD(a,b).  Similarly, 
if  n  is  odd,  take  a  *  3.2n~^  -  1,  b  ■  3.2n-^  +  I  .  It  is  easily 
verified  that  3n~4  cells  are  necessary  and  sufficient  in  this  case.  // 

The  best  possible  value  of  the  constant  c  in  Theorems  2  and  3  is 
unknown,  but  we  conjecture  that  it  is  3.  The  following  table  gives,  for 
2  n  18  ,  the  number  of  cells  (as  in  Appendix  A)  required  to  compute 
+CCD(a,b)  for  all  positive  n-bit  integers  (a,b),  and  an  example  (with 
minimal  max  (a,b))  for  which  the  worst  case  applies.  For  comparison  the 
bounds  of  Theorems  2  and  3  are  also  given  in  the  table. 
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n  (number 
of  bits) 

lower  bound 
(Theorem  3) 

cells  required 
in  worst  case 

upper  bound 
(Theorem  2) 

example  of 
worst  raco 

a_ 

b 

2 

1 

3 

7 

1 

3 

3 

5 

6 

10 

7 

5 

4 

7 

10 

13 

15 

13 

5 

11 

11 

16 

17 

23 

6 

13 

15 

19 

57 

47 

7 

17 

18 

22 

33 

125 

8 

19 

20 

25 

119 

213 

9 

23 

23 

28 

319 

349 

10 

25 

26 

32 

647 

693 

11 

29 

29 

35 

1535 

1537 

12 

31 

33 

38 

3847 

3829 

13 

35 

35 

41 

6143 

6145 

14 

37 

38 

44 

10257 

13651 

15 

41 

41 

47 

24575 

24577 

16 

43 

45 

50 

64229 

61519 

17 

47 

47 

53 

98303 

98305 

18 

49 

50 

56 

185487 

210061 

Table  1  :  Number 

of  cells  required 

for  n-bit  inputs 

I 


l 
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7.  A  possible  VLSI  implementation 

The  GCD  cell  defined  in  Appendix  A  is  a  finite-state  machine  whose 
state  is  determined  by  the  24  Boolean  variables  a,  b,  ...  •  Consequently, 

2  L 

it  has  2  states.  In  this  section  we  outline  one  possible  way  in  which 
the  GCD  cell  could  be  implemented  on  a  single  chip  using  current  nMOS 
technology  [13].  We  do  not  claim  that  this  is  the  best  way  to  implement 
the  GCD  cell,  but  it  does  have  the  virtue  of  simplicity. 

A  set  of  Boolean  functions  can  be  implemented  by  a  programmed  logic 
array  (PLA)  [13].  From  [10]  the  area  of  the  PLA  is  approximately 
A  -  64(p+7)(2n^  +  ng  +  6)  ,  where 

n^  »  number  of  input  variables  (or  their  complements), 
nQ  *  number  of  outputs, 

p  ■  number  of  distinct  product  terms  (i.e.  conjuncts) 
when  the  functions  are  written  in  disjunctive  normal  form  (DNF ) ,  and  X  is 
as  in  Mead  and  Conway  [13]. 

We  assume  that  Boolean  variables  a,  b,  ...  are  implemented  by  clocked 
registers  in  a  standard  manner  [13].  It  would  in  principle  be  possible 
to  implement  the  GCD  cell  as  a  single  large  PLA  with  n^.  *  ng  “  18  , 
p  3  83  .  However  it  is  possible  to  split  the  GCD  cell  into  several 
components  which  can  each  be  implemented  by  a  PLA  of  moderate  size.  The 
initial  block  of  assignment  statements  "aout  :■  a;  ...  ;  negout  :»  neg"  is 
easy  to  implement,  as  is  the  statement 


"wait  :■  (wait  or  start)  and  not  startodd" 
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The  remainder  of  the  cell  definition  has  the  form 
"if  El  then  Bl 
else  if  then 


else  B^" 

where  E^»  . E^  are  Boolean  expressions  in  the  6  variables 
startodd,  wait,  a,  b,  shift,  startoddout  and  B^,  ...,  are  blocks 

of  assignment  statements.  We  could  implement  each  of  .  by 

"slave"  PLAs  of  moderate  size.  These  PLAs  compute  in 

parallel  and  their  outputs  are  selected  by  a  "master"  PLA  which  computes 
the  5  (mutually  exclusive)  functions  E1 ,  "-E^.  ^E^.E^, 

~E^.  ~E2*  ^E^.E^,  ~E^«  ~^3*  *  Tbe  number  of 

inputs  (nj),  outputs  (n^)  and  product  terms  (p)  for  the  PLAs  M,  P^, 

...,  Pc.  are  given  in  Table  2. 


PLA 

nI 

n0 

P 

M 

6 

5 

7 

P1 

5 

7 

6 

P2 

1 

1 

1 

P3 

9 

5 

8 

P4 

5 

8 

9 

P5 

9 

4 

12 

Table  2  :  Parameters 

of  PLAs 

for  GCD 

cell  implementation 

I 
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2 

The  total  area  of  the  PLAs  in  Table  2  is  about  20000  X  .  By  way  of 

comparison,  the  "brute-force"  approach  (using  a  single  PLA)  has  area  about 
2 

350000  X  .  These  estimates  neglect  1/0  pads,  routing  between  the  PLAs, 

7  2 

etc.  With  current  technology  the  area  of  a  chip  can  easily  be  10  X  f 
so  it  should  be  possible  to  implement  many  GCD  cells  on  a  single  chip. 


7.  Conclusion 


We  have  shown  that  the  greatest  common  divisor  of  two  n-bit  integers 
(given  in  the  usual  binary  representation)  can  be  computed  in  time  0(n) 
on  a  linear  array  of  0(n)  identical  systolic  cells,  each  of  which  is  a 
finite-state  machine  which  could  be  implemented  on  (part  of)  a  VLSI  chip. 
Thus,  special-purpose  hardware  for  integer  GCD  computation  could  easily 
be  built.  Since  GCD  computation  is  the  most  time-consuming  operation  when 
rational  arithmetic  is  performed,  such  hardware  could  be  worthwhile  for 
applications  involving  exact  rational  arithmetic,  e.g.  symbolic 
computation. 


8.  Postscript  (April  1984) 

A  prototype  systolic  integer  GCD  cell  was  implemented  on  a  multiproject 
chip  (coordinated  by  the  CSIRO  VLSI  Program,  Adelaide,  Australia)  in 
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November  1983,  using  nMOS  technology  with  Mead  and  Conway  design  rules  [13] 
and  X  m  2.5  micron.  For  the  sake  of  variety  and  to  reduce  power 
consumption,  our  implementation  used  "blue-green  function  blocks"  (Plate 
7(b)  of  [13])  instead  of  PLAs.  To  minimize  routing  problems,  we  used  only 
two  function  blocks  instead  of  the  six  suggested  in  Section  7.  One 
function  block,  the  "control"  function  block,  computes  the  functions  E^, 

~^1  ,E'2*  *  *  *  ,~^1*~^2*~^3*~^4  ^see  Sect*on  7)  and  some  other 

Boolean  functions  of  its  inputs,  e.g.  a(+)b.  The  other  function  block, 

the  "main"  function  block,  computes  all  other  required  Boolean  functions. 

To  implement  the  finite  state  machine,  18  outputs  of  the  main  function 
block  are  fed  back  (through  clocked  registers)  to  the  control  function 
block.  To  simplify  testing  we  used  static  rather  than  dynamic  registers. 

The  total  size  of  the  cell  is  688 X  by  1022X.  About  40  percent  of  the  area 
is  occupied  by  the  two  function  blocks,  the  remainder  being  used  for  I/O 
pads,  registers,  clock  drivers,  etc.  The  layout  is  illustrated  in  Figure  7. 
Preliminary  testing  indicates  that  the  prototype  performs  as  expected. 
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Figure  7  :  Prototype  systolic  Integer  GCD  cell 
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For  applications  we  usually  want  to  compute  the  extended  GCD.  It  Is 
possible  to  do  this  using  a  straightforward  extension  of  the  systolic  cell 
defined  in  Appendix  A:  see  Bojanczyk.  and  Brent  [16]. 

Recently  Purdy  [18]  has  suggested  a  different  way  to  compute  integer 
GCDs  in  linear  time.  However,  although  Purdy's  algorithm  is  linear  on 
average,  its  worst-case  behaviour  is  quadratic.  Consequently,  our  systolic 
algorithm,  which  is  linear  even  in  the  worst  case  (by  Theorem  1),  seems 
preferable. 

I 
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{The  language  used  is  Pascal  wich  some  trivial  extensions  and  declarations 
omitted.  See  Figure  5  for  I/O  ports} 


aout  :■  a;  a  :*«in;  {standard  transfers} 

bout  :■  b;  b  :*  bin; 

startout  :*  start;  start  startin; 

startoddout  :■  startodd;  startodd  :■  startoddin; 

epsout  :*  eps2;  eps2  :■  eps;  eps  :»  epsin;  {delay  here} 
negout  :•  neg; 


wait  :■  (wait  or  start)  and  not  startodd;  {wait  for  nonzero  bit} 


if  startodd  or  (wait  and  (a  or  b))  then 
begin 

eps  :■  eps  or  wait; 
eps2  :■  0;  {0  =  false,  1  =  true} 
neg  :*  negin  and  not  wait; 
startodd  :■  1; 

wait  :*  0;  {end  of  waiting  for  a  nonzero  bit} 

swap  :»  not  a; 

shift  :■  not  (a  and  b) 

end 

else  if  wait  then  epsout  :»  eps2  {normal  speed) 
else  if  shift  then  {shift  b  faster  than  a,  may  also  swap} 
begin 

aout  :*  (bout  and  swap)  or  (aout  and  not  swap);  {normal  speed} 

bout  :«  (a  and  swap)  or  (b  and  not  swap);  {fast  speed} 

epsout  :■  (eps  and  neg)  or  (epsout  and  not  neg); 

neg  :■  neg  and  not  (eps  and  startoddout);  {<5  may  become  zero} 

negout  :■  neg 

end 

else  if  startoddout  then 


begin 

epsout  :■  eps2;  {normal  speed} 

swap  :*  not  neg; 

neg  :■  neg  or  not  eps2;  {5  — | 5 1 } 

negout  :*  neg; 

aout  :*  aout  or  swap;  {swap  implies  b} 

bout  :■  0;  {and  new  b  is  even} 

carry  :■  a©b;  {may  be  borrow  or  carry;  @  is  exclusive  or} 

minus  :»  not  carry  {l  iff  we  form  (b  -  a)  div  2} 

end 

else  {not  startoddout} 
begin 

epsout  :■  eps2;  {normal  speed} 

aout  :■  (bout  and  swap)  or  (aout  and  not  swap);  {normal  speed} 
bout  :•  a  ® b ©  carry;  {fast  speed} 

carry  :•  majority  (b,  carry,  a  ©minus)  {majority  true  if  2  or  3  of 


its  arguments  are  true} 


end 


[cnl  cells  as  defined  in  Appendix  A  are  sufficient  to  compute 
+  GCD(a,b)  for  any  n-bit  integers  a  and  b  ,  where 
c  •  2/log2  [*HlL]  “3.1105  . 

Proof 

The  result  is  trivial  if  a  or  b  is  zero,  so  suppose  both  a  and  b 
are  nonzero. 

Let  2r  (k  >.  0)  be  the  largest  power  of  two  which  divides  both  a 
and  b  .  Then  the  loop  L0  of  Algorithm  PM  reduces  (a,b)  to  (a/2k,  b/2k) 
and  its  systolic  implementation  uses  k  cells.  At  this  point 
6-0,  | a |  2n-k  ,  I b I  £  2n-k  ,  and  at  least  one  of  a  and  b  is 
odd.  Subject  to  this  condition,  it  is  easy  to  see  that  the  "while"  loop 
LI  of  Algorithm  PM  is  equivalent  to  Algorithm  P  (Figure  6)  modulo 
interchanges  of  a  and  b  and  corresponding  sign  reversals  of  6  . 


